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Abstract 

Teacher efficacy has proven to be an important variable in teacher 
effectiveness. It is consistently related to positive teaching 
behaviors and student outcomes. However/ the measurement of this 
construct is the subject of current debate/ which includes 
critical examination of predominant instruments used to assess 
teacher efficacy. The present study extends this critical 
evaluation and examines sources of measurement error variance in 
the Teacher Efficacy Scale (TES) / historically the most frequently 
used instrument in the area. Reliability generalization was 
utilized to characterize the typical score reliability for the TES 
and potential sources of measurement error variance across 
studies. Other related instruments were also examined as regards 
measurement integrity . 
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A Reliability Generalization Study of the Teacher Efficacy Scale 

and Related Instruments 

Perhaps one of the best documented attributes of effective 
teachers is a strong sense of efficacy. Researchers have 
repeatedly related teacher efficacy to a variety of positive 
teaching behaviors and student outcomes (cf. Tschannen-Moran, 
Woolfolk Hoy, & Hoy, 1998) . Teacher efficacy is strongly related 
to achievement (Ashton & Webb, 1986; Moore & Esselman, 1992; Ross, 
1992) , students' own sense of efficacy (Anderson, Greene, & 

Loewen, 1988), and student motivation (Midgley, Feldlaufer, & 
Eccles, 1989) . Teachers high in efficacy tend to experiment more 
with methods of teaching to better meet their students' needs 
(Guskey, 1988; Stein & Wang, 1988) . Among other things, 
efficacious teachers plan more (Allinder, 1994), persist longer 
with students that struggle (Gibson & Dembo, 1984), and show less 
criticism toward student errors (Ashton & Webb, 1986) . 

While the study of teacher efficacy has borne much fruit, the 
meaning and the appropriate methods of measuring the construct 
have become the subject of recent debate (Tschannen-Moran et al., 
1998) . This dialogue has centered on two issues. First, based on 
the theoretical nature of the self-efficacy construct (Bandura, 
1977, 1997), researchers have argued that self-efficacy is best 
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measured within context regarding specific behaviors (see e.g., 
Pajares, 1996). Second, the construct validity of scores from a 
variety of instruments purporting to measure teacher efficacy and 
related constructs has been questioned (Coladarci & Fink, 1995; 
Guskey & Passaro, 1994). 

The Meaning and Measure of Teacher Efficacy 

Bandura (1977, 1997) presented self-efficacy as a mechanism 
of behavioral change and self-regulation in his social cognitive 
theory. Defined as "beliefs in one's capabilities to organize and 
execute the courses of action required to produce given 
attainments", Bandura (1997, p. 3) proposed that efficacy beliefs 
were powerful predictors of behavior since they were ultimately 
sel f -ref erent in nature and directed toward specific tasks. The 
predictive power of efficacy has generally been borne out in the 
research, especially when efficacy beliefs are measured concerning 
specific tasks (cf . Pajares, 1996) . 

Many researchers have applied Bandura's social cognitive 
theory concepts to teachers, among the first of which were Ashton 
and Webb (1982). They argued that two items previously used by 
RAND researchers (Armor et al., 1976; Berman, McLaughlin, Bass, 
Pauly, & Zellman, 1977) to study teacher efficacy actually 
corresponded to Bandura' s self-efficacy and outcome expectancy 
dimensions of social cognitive theory. These dimensions have been 
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subsequently labeled personal teaching efficacy and general 
teaching efficacy, respectively. 

In an effort to further the study of teacher efficacy, Gibson 
and Dembo (1984) developed the Teacher Efficacy Scale (TES) . The 
TES was the first significant attempt to empirically develop a 
data collection instrument to tap into this potentially powerful 
variable in teachers. The outcome of Gibson and Dembo' s study was 
a 16-item instrument (reduced from 30 items) in six-point Likert 
format consisting of two essentially uncorrelated subscales: 
personal teaching efficacy (PTE, nine items) and general teaching 
efficacy (GTE, seven items) . The TES has subsequently become the 
predominate instrument in the study of teacher efficacy, leading 
Ross (1994, p. 382) to label it a "standard" instrument in the 
field. Largely utilizing the TES, researchers have linked teacher 
efficacy to multiple positive variables in teaching effectiveness 
as well as positive student outcomes, including achievement 
variables . 

Other tests have also been developed to assess teacher 
efficacy and related constructs. For example, since self-efficacy 
is most appropriately measured in specific contexts, Riggs and 
Enochs (1990) developed a subject matter instrument to measure 
efficacy for teaching science, the Science Teaching Efficacy 
Belief Instrument (STEBI) . This instrument was based on the TES 
and also consisted of two largely uncorrelated subscales: personal 
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science teaching efficacy (PSTE) and science teaching outcome 
expectancy (STOE) . In most applications, the STEBI consists of 25 
items with a five-point Likert scale. 

Furthermore, several tests have evolved from a slightly 
different, but related, theoretical orientation than Bandura's 
(1997) social cognitive theory. Specifically, Rotter's (1966) 
locus of control theory has played an important historical role in 
the conceptualization of teacher efficacy as a construct (cf. 
Tschannen-Moran et al., 1998) . Intuitively, one's locus of control 
orientation may impact one's perceived beliefs in his or her 
ability to execute actions that lead to success in a given 
attainment. Instruments in this locus of control tradition have 
informed the study of teacher efficacy from a construct validity 
standpoint (Coladarci & Fink, 1995) and are often used in teacher 
efficacy studies. 

Two of the more frequently used instruments in the Rotter 
(1966) tradition are the Teacher Locus of Control (TLC, Rose & 
Medway, 1981) and the Responsibility for Student Achievement (RSA, 
Guskey, 1981) . The TLC consists of 28 forced choice items that 
present situations of student success (14 items) and student 
failure (14 items) . The two forced choice options allow for either 
an internal (teacher) or external (student) explanation for the 
student outcome. The TLC yields two subscale scores, one 
reflecting internal locus of control for student success (1+) and 
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the other internal locus for student failure (I-) . Similarly, the 
RSA consists of 30 items also presenting two possible explanations 
(internal v. external) for student success and failure. However, 
the RSA asks respondents to weight each explanation by dividing 
100 percentage points between the options. Scoring results in two 
subscales, one assessing responsibility for student success (RSA+) 
and the other responsibility for student failure (RSA-) . 

In an important article, Tschannen-Moran et al. (1998) 
reviewed the history and measurement methods for teacher efficacy. 
They challenged both current conceptualization of teacher efficacy 
as a construct and questioned the psychometric properties of 
predominate instruments in the field. Particularly, Tschannen- 
Moran et al. presented a thoughtful critique of the construct 
validity of scores from the TES (Gibson & Dembo, 1984) . They 
disagreed with Gibson and Dembo' s claim that the PTE and GTE 
subscales of the TES reflect Bandura's (1977) self-efficacy and 
outcome expectancy dimensions of social cognitive theory. Other 
researchers have made similar claims as regards construct validity 
(cf. Coladarci & Fink, 1995; Guskey & Passaro, 1994). Primarily, 
these criticisms have focused on the GTE subscale, while the PTE 
subscale has been less maligned. 

Purpose 

Given the potential value of teacher efficacy as a construct 
and in light of the current controversy over how to best measure 
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teacher efficacy, it is relevant to examine in greater detail the 
psychometric properties of the TES and related instruments . Recent 
examinations have concerned themselves with validity issues 
(Coladarci & Fink, 1995; Guskey & Passaro, 1994) but none have 
specifically addressed the ability of these tests to yield 
reliable scores. The study of teacher efficacy could benefit from 
an understanding of the extent to which these instruments yield 
reliable scores and what factors contribute to variation in the 
reliability estimates. The purpose of the present paper is to 
examine the TES and related instruments noted above as regards 
score reliability. Reliability generalization was used as a meta- 
analytic framework to examine sources of measurement error 
variance across studies using these instruments and to 
characterize typical score reliabilities for given tests (Vacha- 
Haase, 1998 ) . 

Score Reliability and Reliability Generalization 

In order to contextualize the current study, it is important 
to emphasize that scores, not tests, are either reliable or 
unreliable (Thompson, 1994; Vacha-Haase, 1998) . As correctly noted 
by Gronlund and Linn (1990), "Reliability refers to the results 
obtained with an evaluation instrument and not to the instrument 
itself. Thus it is more appropriate to speak of the reliability of 
'test scores' or the 'measurement' than of the 'test' or the 
'instrument'" (p. 78, emphasis in original) . Unfortunately, the 
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incorrect but common phraseology concerning the "reliability of 
the test" leads many to incorrectly assume that reliability is 
inured to tests rather than scores, and results in researchers 
often failing to examine score reliability for their data. 

Many factors impact the degree that a given test will yield 
reliable scores for a given administration, not the least of which 
includes the characteristics of the sample measured. For example, 
Thompson (1994) observed: "The same measure, when administered to 
more heterogeneous or more homogeneous sets of subjects, will 
yield scores with differing reliability" (p. 839) . This may occur 
because reliability estimates are heavily impacted by total score 
variability. In terms of classical measurement theory (holding the 
number of items on the test and the sum of item variances 
constant) , increased variability of total scores suggests that we 
can more reliably order people on the trait of interest, and thus 
more accurately measure them. This assumption is made explicit in 
the test-retest reliability case, when consistent ordering of 
people across time on the trait of interest is critical in 
obtaining high reliability estimates. 

Unfortunately, researchers often fail to cite reliability 
estimates for their data, and often assume that estimates from 
prior studies or test manuals suffice for their current study 
(Vacha-Haase, Kogan, & Thompson, in press) . However, as Pedhazur 
and Schmelkin (1991) noted, "Such information may be useful for 
O 
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comparative purposes/ but it is imperative to recognize that the 
relevant reliability estimate is the one obtained for the sample 
used in the study under consideration" (p. 86) . Empirical studies 
confirm that very few researchers actually report reliability 
estimates for their data (cf. Caruso, 2000; Vacha-Haase, 1998; Yin 
& Fan, 2000), For example, Yin and Fan observed that only 7.5% of 
articles employing the Beck Depression Inventory reported precise 
reliability estimates for the data in hand. 

Because sample characteristics can impact score reliability, 
researchers that only report reliability from prior studies or 
test manuals should at least make explicit comparisons concerning 
their sample's composition and variability to the sample 
referenced in the prior study. As Dawis (1987) explained, "Because 
reliability is a function of sample as well as of instrument, it 
should be evaluated on a sample from the intended target 
population - an obvious but sometimes overlooked point" (p. 486). 
As the current sample differs from that referenced, the current 
reliability estimates may also differ. Regarding this comparison 
between samples, Thompson and Vacha-Haase (2000) suggested that, 
The crudest and barely acceptable minimal evidence of score 
quality in a substantive study would involve an explicit and 
direct comparison (Thompson, 1992) of (a) relevant sample 
characteristics (e.g., age, gender), whatever these may be in 
the context of a particular inquiry, with the same features 
O 
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reported in the manual for the normative sample or in earlier 
research and (b) the sample score SD with the SD reported in 
the manual or in other earlier research, (p. 190, emphasis in 
original) 

Vacha-Haase et al. (in press) termed the process of using a 
prior study's reliability estimates for one's own data 
"reliability induction", suggesting that researchers inductively 
generalize from specific instances to a broader conclusion. That 
is, researchers assume that because reliable scores were obtained 
in prior instances, reliable scores will be obtained in entirely 
new data (which, of course, is not necessarily the case) . Vacha- 
Haase et al. argued that reliability induction was only reasonable 
when the sample composition and variability between the current 
and referenced samples are comparable. Furthermore, they 
presented data illustrating the incongruence between current and 
prior samples for two tests. 

Since reliability may, and does, vary upon different 
administrations of a test, Vacha-Haase (1998) employed a meta- 
analytic method called reliability generalization that allows 
examination of the variability of score reliability across 
studies. In addition, coded study characteristics (such as 
composition and variability) can be used as potential predictors 
of reliability variation, thereby providing some evidence of 
sampling conditions when reliability may be more or less tenable. 
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A modified correlational version of this method was employed in 
the present study regarding the TES and related instruments. 

Method 

Sample of Instruments and Articles 

Four instruments were selected based on their frequency of 
use in the study of teacher self-efficacy (Bandura, 1977) and 
teacher locus of control (Rotter, 1966) . In the self-efficacy 
tradition, these instruments included the TES and the STEBI. In 
the locus of control tradition, the TLC and RSA were examined. 

All of these instruments consist of two subscales (described 
above) . Since score reliability is most appropriately examined for 
individual subscales (constructs), the subscales were the focus of 
analysis . 

Searches of the PsycINFO and ERIC databases were conducted 
for articles published between 1981 through February 1999. The 
primary search in both databases was broad and used the keywords 
"teacher AND efficacy". Other secondary searches, using the name 
of each test, were conducted to ensure selection of articles using 
the other tests. In totality, the PsycINFO search yielded a total 
of 639 articles and the ERIC search yielded 975 articles and 
conference presentations. Since the clear majority of relevant 
articles were found in both databases, only conference 
presentations were utilized from the ERIC search. 
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The selected articles and presentations (herein referred to 
as articles) were read and retained if they included either a 
reported reliability coefficient for the data in hand from a 
subscale or if the authors reported the mean, standard deviation, 
and number of items in the subscale. All articles that were false 
hits, in non-English languages, or not obtainable were eliminated. 
In addition, articles that used one of the tests but did not 
either report the necessary information or did not meaningfully 
report reliability (such as a range of reliability estimates or 
reliability for combined subscales) were also eliminated. These 
selection procedures left 52 articles for further analysis. 
However, these articles frequently reported score reliabilities or 
means and standard deviations for multiple groups (e.g., treatment 
and control, male and female) yielding 213 useful observations. Of 
these 213 entries, 86 reliability coefficients (all internal 
consistency estimates) were available for the four instruments. 

As expected the TES was the most frequently used test and the 
majority of reliability estimates (25 for PTE, 21 for GTE) were 
from scores on TES subscales. Subscales on the other tests had 
much fewer reported estimates from data in hand (13 PSTE, 11 STOE, 
3 I + , 3 I-, 5 RSA+, 5 RSA-) . 

Coding of Study Characteristics 

The 52 articles selected were each read and 15 study 
characteristics were coded. Of the 52 articles, 43 were dually 
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coded by two independent raters. Interrater reliability was 
examined by calculating the percent of perfect agreement between 
raters out of all possible ratings. This percentage was computed 
for each of the 15 coded variables and ranged from 76.09% to 100% 
agreement (M = 91.35%, SD = 6.92%). In addition, accuracy of 
coding was checked by a third rater, who examined and corrected 
observed discrepancies between the independent raters. The third 
rater also audited the 9 articles that were not dually coded and 
made minor corrections. 

Although multiple study characteristics were coded, the small 
percentage of studies actually reporting reliability coefficients 
(all internal consistency estimates) limited the number of 
variables that could be used for analysis. As such, selected 
bivariate correlational analyses were conducted in lieu of 
multiple regression. Variables were selected for use in the 
present study based on their potential for capturing differences 
in sample homogeneity as regards the variable of interest. These 
variables were: 

1. Teacher experience: 0 for preservice, 1 for inservice. 

2. Teaching level: 0 for elementary, 1 for mixed levels. (Note: 
Other teaching level contrasts were coded, including elementary 
versus secondary. However, no variance existed in these contrasts 
due to limited score reliability reporting for data in hand.) 

3. Teaching area: 0 for regular/general education and 1 for 
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other, including special education. 

4. Gender homogeneity: Coded as proportion of the number of 
persons in the majority gender to total sample size. As such, 
this variable ranges from .50 to 1.00. This proportion measures 
gender homogeneity, regardless of whether that homogeneity was due 
to females or males. 

5. Sample size. 

6. Number of items in subscale. 

7. Standard deviation of subscale: When standard deviations were 

given for the sum of participants' responses, these standard 
deviations were converted to the average item level. 

8. Mean of subscale: When means were given for the sum of 
participants' responses, these means were converted to the average 
item level. 

Estimating Reliability 

Reliability was estimated with KR-21 (Kuder & Richardson, 
1937) for the dichotomously scored Teacher Locus of Control 
subscales (1+ and I-) . KR-21 requires knowledge of the mean, 
standard deviation, and number of items on the test. The formula 
assumes that all item difficulties are equal and, as a matter of 
degree, the coefficient may be expected to be an underestimate of 
reliability when this assumption is not met. Because only two 
cases using the TLC reported both reliability from data in hand 
and means and standard deviations, a comparison of the accuracy of 
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the KR-21 estimate was not possible. Because KR-21 is likely to 
underestimate reliability, the KR-21 estimates were used as the 
reliability estimate for all analyses concerning the subscales of 
the TLC. This was necessary to ensure that the reliability 
estimates maintained their relative position in the distribution, 
despite potentially underestimating score reliability. 

To obtain the uncorrected total score variance estimates 
necessary for KR-21, we converted the reported standard deviation 
with the formula: 

a 2 = [SD 2 * (n-1) ] / n 

where SD is the standard deviation of total scores reported for 
the subscale and n is the sample size for which the S^D was 
reported. This estimate was then used in the KR-21 formula. It 
should be noted that KR-21 was not applied to the other subscales 
since their response formats were non-dichotomous . In its 
traditional version as reported by Kuder and Richardson (1937) , 
KR-21 does not generalize to this type of data (e.g., Likert 
scales) . 

Total Score Variance and Reliability 

Because total score variance is a central component to 
internal consistency reliability estimates/ correlational analyses 
were conducted for all subscales between uncorrected variance 
estimates with reported (or estimated for the TLC) score 
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reliabilities. Uncorrected variances were computed at the item 
level using the above noted formula. 

Results 

Figure 1 characterizes the distributions of reliability 
estimates with boxplots. Table 1 presents descriptives for the 
subscales. Examination of Figure 1 indicates considerable 
variation of score reliability between subscales and within some 
subscales , particularly the two subscales of the TES (PTE and GTE) 
and the Internal Failure (I-) subscale of the TLC. Reliabilities 
had ranges of .26 or higher on each of these subscale, 
representing at least 26% fluctuation in true score variance from 
minimum to maximum estimates. Figure 1 also suggests that several 
subscales were relatively consistent in their ability to yield 
reliable scores, particularly the PSTE subscale of the STEBI and 
the Internal Success (1+) subscale of the TLC. 




INSERT FIGURE 1 AND TABLE 1 ABOUT HERE 
The two efficacy measures, TES and STEBI, performed similarly 
as regards score reliability. In general, the PTE and PSTE 
subscales yielded more reliable scores than the GTE and STOE 
subscales. This outcome was expected since the STEBI was modeled 
after the TES. Interestingly, both subscales purporting to 
measure personal efficacy (PTE and PSTE) yielded reliabilities 
that were outliers from the distribution of reliability estimates. 
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This finding illustrates the fact that reliability is a function 
of scores , not tests, and that estimates may vary considerably 
upon different administrations of the test. The PSTE subscale, for 
example, yielded stable score reliabilities with three exceptions, 
one of which (.74) was unexpectedly low relative to the 
distribution. While all of the estimates for PSTE were reasonably 
acceptable, the lowest estimate for PTE was marginal and several 
from GTE and I- were quite low. Again, these estimates illustrate 
that score reliability is not a stable characteristic that is 
"indelibly and unalterably stamped into test booklets [or prior 
published research] during the printing process" (Thompson & 
Vacha-Haase, 2000, p. 177). Instead reliability can be affected by 
other study characteristics, not the least of which are sample 
attributes . 

Table 1 also presents correlations between selected study 
characteristics and reported score reliabilities. Because so few 
authors reported score reliabilities for the data in hand, only 
correlational analyses were possible in the present study as 
opposed to a more full-fledged reliability generalization using 
more advanced methods. Results indicated that different subscales 
were related to different study characteristics, suggesting that 
study characteristics may have had differential impact on 
reliability estimates. It is important to note, however, that 
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these results are tentative and limited by the dearth of score 
reliability estimates reported for data in hand. 

Teacher experience and teaching level were negatively related 
to both TLC subscales; reliability estimates were lowest for 
inservice teachers and teachers of mixed teaching levels. It might 
be expected that that preservice teachers would be more 
heterogeneous as regards locus of control (thereby yielding more 
reliable scores), not having had the experience of teaching to 
solidify their perceptions of student success and failure. 

However, one might also expect mixed teaching levels to me more 
heterogenous than the elementary level. If so, one would expect 
higher reliabilities for the mixed group, which did not occur. 

Teaching area was unrelated to reliability estimates. 

However, gender homogeneity was consistently negatively related to 
score reliability, with the exception of the RSA. The high 
positive correlations for RSA are likely artifacts of only having 
three observations. Although the gender homogeneity correlations 
are weak to moderate, the consistent negative relationship to 
score reliability suggests lower reliability may be obtained from 
samples of larger proportions of one gender. 

Sample size fluctuated in both size and direction in its 
relationship with reliability. In a study of Big Five factors of 
personality, Viswesvaran and Ones (2000) reported no relationship 
between reliability coefficients and sample size. The present 
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findings are inconsistent with this prior research but are unclear 
as regards any predictable relationship between the variables. 

Correlations between reported subscale variances and 
reliability coefficients were all high positive. As noted, score 
variance is a critical component of classical test theory 
reliability estimation. Coefficient alpha tends to increase as 
total score variance increases. The present findings supported 
this premise. 

Finally, all correlations (except one) between the number of 
items on the subscale and the reliability estimate were also 
positive, illustrating the common understanding that as the number 
of items on a test increases, reliability estimates are also 
likely to increase. However, the one negative correlation 
indicates that this common understanding is not always correct. 
Reliability is impacted by factors beyond the length of the test 
such that shorter forms of tests may actually yield more reliable 
scores. As Thompson (1990) noted, "Notwithstanding erroneous 
folkwisdom to the contrary, sometimes scores from shorter tests 
are more reliable than scores from longer tests" (p. 586). Vacha- 
Haase (1998) cites the Bern Sex Role Inventory as an example of 
this phenomenon. 

A potential "reliability induction" analysis of the TES 
between the current study' s reported standard deviations and the 
variability of subscales given in the original Gibson and Dembo 
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(1984) article was not possible because, unfortunately, no 
standard deviations were reported in the Gibson and Dembo article. 
At a minimum, Thompson and Vacha-Haase (2000, p. 190)) noted that 
the "crudest and barely acceptable minimal evidence of score 
quality" would be an explicit comparison of the current sample's 
composition and variability with that referenced with the prior 
reliability coefficient. Such comparisons are problematic 
(impossible) when insufficient information is reported concerning 
test construction. Of course, the best evidence of adequate score 
reliability for one's own data is to actually compute it - a 
process that takes at least a minute with modern computing 
capabilities ! 

Discussion 

Considerable variability was observed between instruments as 
regards to their ability to yield reliable scores. Mean 
reliability coefficients tended to be acceptable for the 
instruments, although what is acceptable is a somewhat arbitrary 
decision and ultimately determined by the context of a study. 
Potential fluctuation of reliability coefficients was also evident 
within all instruments, particularly for the TES's personal and 
general teaching efficacy subscales and the TLC's internal failure 
subscale. Because reliability may fluctuate, researchers should 
always examine the reliability of their data in hand and report 
it. It is insufficient to assume that a test will yield reliable 
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scores solely because reliable scores have been obtained in the 
past. An even more egregious error is to assume a test will yield 
reliable scores when reliability has been marginal in the past, 
such as for the general teaching efficacy subscale of the TES (see 
Figure 1) . Furthermore, even in substantive studies, reporting 
reliability coefficients is critical because effect sizes are 
attenuated by the observed reliabilities (Reinhardt, 1996) . 

Regarding the TES, the personal teaching efficacy subscale 
tended to maintain stronger score integrity than the general 
teaching efficacy subscale. This finding suggests that the general 
teaching efficacy subscale may be susceptible to measurement error 
problems in addition to its questioned construct validity 
(Coladarci & Fink, 1995; Guskey & Passaro, 1994; Tschannen-Moran 
et al., 1998). Accordingly, use of the general teaching efficacy 
subscale as a measure of teacher efficacy is questionable practice 
at best. Correlational analyses revealed no clear patterns 
regarding the relationship between reliability coefficients and 
study characteristics for the TES. However, the failure of many 
authors to report reliability information limited the number of 
characteristics examined and sensitivity of the analyses used. 
Therefore, the present results are inconclusive regarding the 
relationship between study characteristics and score reliability 
on the TES. What is clear, however, is that total score variance 
was consistently related to reliability coefficients. Range 
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restriction for homogeneous samples is likely to lower reliability 
estimates and appeared to do so in the present study. The negative 
relationship between reliability and gender homogeneity also 
provided limited evidence of this possibility. 

Because the STEBI was developed from the TES, its performance 
was similar to the TES. Looking at the results for both the TES 
and the STEBI in Figure 1 , it is clear that the personal teaching 
efficacy subscales tend to yield less measurement error in their 
scores. The tests consistently yielded lower score reliabilities 
for the general teaching efficacy or outcome expectancy subscales. 
These findings are consistent with the current debate surrounding 
the TES and the personal and general teaching efficacy constructs. 
While prior debate has focused on construct validity of scores 
from these tests (Tschannen-Moran et al., 1998), the present study 
suggests that the psychometric difficulties of the general 
teaching efficacy subscales are also problematic as regard 
measurement error. Furthermore, with one subscale exception, the 
TES yielded the most, variable reliability coefficients of all the 
instruments . 

In sum, while the personal teaching efficacy subscale tended 
to include less measurement error in its scores, the reported 
reliability estimates were quite variable across studies with low 
estimates in the marginal range. Coefficients from the general 
teaching efficacy subscale were consistently lower and also highly 
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variable. The TES, if it is to see continued use in the study of 
teacher efficacy, likely should undergo revision with an eye to 
measurement integrity. Given the debate over the construct 
validity and current evidence of poor reliability of scores for 
the general teaching efficacy subscale, the subscale should 
potentially be abandoned and replaced with efforts to more 
reliably measure the outcome expectancy dimension of Bandura's 
(1997) social cognitive theory. Tschannen-Moran et al. (1998) have 
presented a new model of teacher efficacy that may serve to advise 
development of new measurements in the field. Henson, Bennett, 
Sienty, and Chambers- (2000) reported some support for this model 
and its application of the relevant constructs . Researchers of 
teacher efficacy would do well to pursue measurement strategies in 
this direction, and if tests are developed to aid the process, 
researchers should be certain to examine score reliability for 
data in hand, even in substantive studies. After developing their 
tests, researchers would also do well not to then erroneously 
claim that their "test is reliable". 
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Figure 1 . Boxplot of reliability estimates for each subscale 
from four instruments. 

Note. PTE = Personal Teaching Efficacy (TES), GTE = General 
Teaching Efficacy (TES), PSTE = Personal Science Teaching 
Efficacy (STEBI), STOE = Science Teaching Outcome Expectancy 
(STEBI), 1+ = Internal Success (TLC) , I- = Internal Failure 
(TLC) , RSA+ = Responsibility for Success (RSA) , RSA- = 
Responsibility for Failure (RSA) . 
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